34 research outputs found

    GPUs as Storage System Accelerators

    Full text link
    Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project explores the feasibility of harnessing GPUs' computational power to improve the performance, reliability, or security of distributed storage systems. In this context, we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing, and introduce techniques to efficiently leverage the processing power of GPUs. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications' performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201

    Active Data: A Data-Centric Approach to Data Life-Cycle Management

    Get PDF
    International audienceData-intensive science offers new opportunities for innovation and discoveries, provided that large datasets can be handled efficiently. Data management for data-intensive science applications is challenging; requiring support for complex data life cycles, coordination across multiple sites, fault tolerance, and scalability to support tens of sites and petabytes of data. In this paper, we argue that data management for data-intensive science applications requires a fundamentally different management approach than the current ad-hoc task centric approach. We propose Active Data, a fundamentally novel paradigm for data life cycle management. Active Data follows two principles: data-centric and event-driven. We report on the Active Data programming model and its preliminary implementation, and discuss the benefits and limitations of the approach on recognized challenging data-intensive science use-cases.Les importants volumes de données produits par la science présentent de nouvelles opportunités d'innovation et de découvertes. Cependant ceci sera conditionné par notre capacité à gérer efficacement de trÚs grands jeux de données. La gestion de données pour les applications scientifiques data-intensive présente un véritable défi~; elle requiÚre le support de cycles de vie trÚs complexes, la coordination de plusieurs sites, de la tolérance aux pannes et de passer à l'échelle sur des dizaines de sites avec plusieurs péta-octets de données. Dans cet article nous argumentons que la gestion des données pour les applications scientifiques data-intensive nécessite une approche fondamentalement différente de l'actuel paradigme centré sur les tùches. Nous proposons Active Data, un nouveau paradigme pour la gestion du cycle de vie des données. Active Data suit deux principes~: il est centré sur les données et à base d'événements. Nous présentons le modÚle de programmation Active Data, un prototype d'implémentation et discutons des avantages et limites de notre approche à partir d'étude de cas d'applications scientifiques

    Interacting with Large Distributed Datasets using Sketch

    Get PDF
    We present Sketch, a distributed software infrastructure for building interactive tools for exploring large datasets, distributed across multiple machines. We have built three sophisticated applications using this framework: a billion-row spreadsheet, a distributed log browser, and a distributed- systems performance debugging tool. Sketch applications allow interactive and responsive exploration of complex distributed datasets, scaling gracefully to large system sizes. The conflicting constraints of large-scale data and small timescales required by human interaction are difficult to satisfy simultaneously. Sketch exploits a sweet spot in this trade-off by exploiting the observation that the precision of a data view is limited by the resolution of the user?s screen. The system pushes data reduction operations to the data sources. The core Sketch abstraction provides a narrow programming interface; Sketch clients construct a distributed application by stacking modular components with identical interfaces, each providing a useful feature: network transparency, concurrency, fault-tolerance, straggler avoidance, round-trip reduction, distributed aggregation

    Embracing diversity : optimizing distributed storage systems for diverse deployment environments

    No full text
    Distributed storage system middleware acts as a bridge between the upper layer applications, and the lower layer storage resources available in the deployment platform. Storage systems are expected to efficiently support the applications’ workloads while reducing the cost of the storage platform. In this context, two factors increase the complexity of the design of storage systems: First, the applications’ workloads are diverse among number of axes: read/write access patterns, data compressibility, and security requirements to mention only a few. Second, storage system should provide high performance within a certain dollar budget. This dissertation addresses two interrelated issues in this design space. First, can the computational power of the commodity massively multicore devices be exploited to accelerate storage system operations without increasing the platform cost? Second, is it possible to build a storage system that can support a diverse set of applications yet can be optimized for each one of them? This work provides evidence that, for some system designs and workloads, significant performance gains are brought by exploiting massively multicore devices and by optimizing the storage system for a specific application. Further, my work demonstrates that these gains are possible while still supporting the POSIX API and without requiring changes to the application. Finally, while these two issues can be addressed independently, a system that includes solutions to both of them enables significant synergies.Applied Science, Faculty ofElectrical and Computer Engineering, Department ofGraduat

    Beyond music sharing: an evaluation of peer-to-peer data dissemination techniques in large scientific collaborations

    No full text
    The avalanche of data from scientific instruments and the ensuing interest from geographically distributed users to analyze and interpret it accentuates the need for efficient data dissemination. An optimal data distribution scheme will find the delicate balance between conflicting requirements of minimizing transfer times, minimizing the impact on the network, and uniformly distributing load among participants. We identify several data distribution techniques, some successfully employed by today's peer-to-peer networks: staging, data partitioning, orthogonal bandwidth exploitation, and combinations of the above. We use simulations to explore the performance of these techniques in contexts similar to those used by today's data-centric scientific collaborations and derive several recommendations for efficient data dissemination. Our experimental results show that the peer-to-peer solutions that offer load balancing and good fault tolerance properties and have embedded participation incentives lead to unjustified costs in today's scientific data collaborations deployed on over-provisioned network cores. However, as user communities grow and these deployments scale, peer-to-peer data delivery mechanisms will likely outperform other techniques.Applied Science, Faculty ofElectrical and Computer Engineering, Department ofGraduat

    Configurable Security for Scavenged Storage Systems

    No full text
    Scavenged storage systems harness unused disk space from individual workstations the same way idle CPU cycles are harnessed by desktop grid applications lik
    corecore